The data that I chose to use for this project is traffic crash data from the Chiacago data portal. The main traffic crash data (Traffic Crashes) has information about each traffic crash that occurred on city streets within the City of Chicago limits and under the jurisdiction of the Chicago Police Department (CPD). Data from all police districts available starting September 2017

This data can be found using the following link: https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if

I also chose to look at the Traffic Crashes - People dataset which can be used with the main Traffic Crashes dataset and contains information about the people involved in a crash, whether there were any injuries, etc.

This data can be found using the following link: https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
traffic_crashes <- read.csv("/Users/lilykraus/Downloads/DSCI101/Traffic_Crashes.csv")

#Vehicle Data
traffic_crash_vehicles <-read.csv("/Users/lilykraus/Downloads/DSCI101/Traffic_Crashes_Vehicles.csv")

#People Data
traffic_crash_people <- read.csv("/Users/lilykraus/Downloads/DSCI101/Traffic_Crashes_People.csv")

#can join these data sets based on crash record ID
#all_crash_data <- traffic_crashes %>% left_join(traffic_crash_vehicles, by = c("CRASH_RECORD_ID" = "CRASH_RECORD_ID"))

#all_crash_data <- all_crash_data %>% left_join(traffic_crash_people, by = c("CRASH_RECORD_ID" = "CRASH_RECORD_ID"))

For the first part of this project, I chose to look at how the number of car crashes varied by year, month, day, and hour. I also wanted to see what the highest number of car crashes that occurred in a single day was.

#Select the columns related to date and time
just_time_date <- traffic_crashes %>% select(CRASH_DATE,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH)

#Specify the format of the CRASH_DATE data
just_time_date$CRASH_DATE <- as.POSIXct(just_time_date$CRASH_DATE, format = "%m/%d/%Y %I:%M:%S %p")

#Filter years - citywide data was not available until September 2017 and 2023 is still ongoing
just_time_date_filt <- just_time_date %>% mutate(crash_year = year(CRASH_DATE)) %>% filter(crash_year > 2017 & crash_year < 2023)

Looking at the number of crashes per year: I’ve decided to make two separate bar charts for the number of car crashes by year: One that includes all years 2015 - 2023 and one that includes the years 2018 - 2022. I chose to limit the second bar chart to the years 2018-2022 because these are the years for which the dataset has complete data. Citywide data was not available until September 2017, with only some data being reported in 2015. Therefore, I chose to exclude data from years prior to 2017 in my analysis. Additionally, the current year, 2023, is still ongoing and does not have complete data, so it was also excluded.

All years

#Number of crashes by year 
just_time_date %>% 
  mutate(crash_year = year(CRASH_DATE)) %>%
  filter(crash_year >= 2015) %>% 
  group_by(crash_year) %>%
  summarise(num_crashes = n()) %>%
  ggplot(aes(x = as.factor(crash_year),y = num_crashes)) + geom_bar(stat = "identity", fill = "lightseagreen") 

Years 2018 - 2022

Based on this bar chart, there was a considerable drop in the number car crashes from 2019 - 2020. This was probably due to the COVID-19 pandemic.

#Number of crashes by year 
just_time_date_filt %>% 
  group_by(crash_year) %>%
  summarise(num_crashes = n()) %>%
  ggplot(aes(x = as.factor(crash_year),y = num_crashes)) + geom_bar(stat = "identity", fill = "lightseagreen") 

Looking at the number of crashes by month: For this analysis, I chose to look at the data between the years 2018-2022, since these are the years with complete data.

It looks like there’s the greatest proportion of crashes occurring in July and October and the least in April. It looks like there are more crashes occurring in the summer/fall months than the winter/spring months.

#Number of crashes by month
just_time_date_filt %>%
  mutate(month_name = month(CRASH_MONTH, label = TRUE, abbr = FALSE)) %>%
  group_by(month_name) %>%
  summarise(num_crashes = n()) %>%
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = (month_name),y = proportion_crashes)) +
  geom_bar(stat = "identity", fill = "darkorange2") +
  xlab("Month") +
  ylab("Proportion of Crashes") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Looking at the number of crashes by day of the week: The greatest proportion of crashes occur on Fridays and the least on Sundays

#Number of crashes by day
just_time_date_filt %>%
  mutate(day_name = wday(CRASH_DAY_OF_WEEK,label = TRUE)) %>%
  group_by(day_name) %>%
  summarize(num_crashes = n()) %>%
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = day_name,y = proportion_crashes)) +
  geom_bar(stat = "identity", fill = "darkslateblue") +
  xlab("Day of the Week") +
  ylab("Proportion of Car Crashes")

Looking at the number of crashes by hour: The data shows that the smallest proportion of crashes occurs at 4 AM. However, there are spikes in the proportion of crashes around 8 AM and 3-4 PM. These spikes could be attributed to the fact that many people are driving to and from work at those times, resulting in more traffic and a higher likelihood of crashes.

#Number of crashes by hour

just_time_date_filt$crash_time <- format(just_time_date_filt$CRASH_DATE, format = "%I%p")

#Keeps the original hours from the crash_time column
#Sys.Date returns current date in YYYY-MM-DD
#paste0() is used to concatenate the current date with the hours from the crash_time column
#create a new column that contains the current system date and the hours from the crash_time column. This effectively "sets" the date for each crash to the current system date, but keeps the original hours from the crash_time column.


just_time_date_filt %>% 
  mutate(crash_time = as.POSIXct(paste0(Sys.Date(), " ",crash_time), format = "%Y-%m-%d %I%p")) %>% 
  group_by(crash_time) %>% 
  summarise(num_crashes = n()) %>%
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = crash_time, y = proportion_crashes)) +
geom_point(color = "slateblue4") + 
  geom_line(color = "slateblue2") +
  scale_x_datetime(date_labels = "%I%p", date_breaks = "1 hour") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  xlab("Time of Day") +
  ylab("Proportion of Crashes")

#just_time_date_filt %>% group_by(CRASH_HOUR) %>% summarise(num_crashes = n()) %>% ggplot(aes(x = CRASH_HOUR, y = num_crashes)) + geom_point(color = "slateblue4") + geom_line(color = "slateblue2")

Largest number of crashes occurring in a single day?

583 on 01/12/2019

Smallest number of crashes occurring in a single day?

85 on 03/29/2020 - This was right around when the COVID-19 lockdown started in Chicago. It’s worth noting that the 20 days with the smallest number of crashes occurred around the beginning of the COVID-19 lockdown. This observation is interesting and could suggest a connection between the lockdown and a reduction in the number of crashes.

#What's the largest number of traffic crashes that has occurred in a single day?
just_time_date_filt$crash_date_notime <- format(just_time_date_filt$CRASH_DATE, format = "%m/%d/%Y")

just_time_date_filt %>% group_by(crash_date_notime) %>% summarize(num_crashes = n()) %>% arrange(-num_crashes) %>% head(10)
## # A tibble: 10 × 2
##    crash_date_notime num_crashes
##    <chr>                   <int>
##  1 01/12/2019                583
##  2 02/14/2020                567
##  3 11/12/2019                551
##  4 10/29/2021                512
##  5 12/10/2021                491
##  6 09/17/2021                473
##  7 06/01/2018                464
##  8 06/19/2021                460
##  9 12/09/2022                456
## 10 10/31/2019                454
just_time_date_filt %>% group_by(crash_date_notime) %>% summarize(num_crashes = n()) %>% arrange(-num_crashes) %>% tail(10)
## # A tibble: 10 × 2
##    crash_date_notime num_crashes
##    <chr>                   <int>
##  1 04/05/2020                124
##  2 03/28/2020                123
##  3 03/31/2020                120
##  4 03/26/2020                116
##  5 03/24/2020                113
##  6 04/06/2020                113
##  7 04/12/2020                113
##  8 04/09/2020                107
##  9 03/22/2020                 96
## 10 03/29/2020                 85

How do numbers of vehicle crashes change during the COVID-19 pandemic?

I added a vertical dotted line to show when lockdown started in Illinois. The graph reveals a sharp decrease in the number of car crashes occurring each day, reaching some of its lowest points soon after the onset of lockdown.

#How do numbers of vehicle crashes change during the COVID-19 pandemic?
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Mads approves of this graph

#Define lockdown start date
lockdown_start <- as.Date("2020-03-21")

#convert CRASH_DATE column to date
just_time_date_filt$CRASH_DATE <- as.Date(just_time_date_filt$CRASH_DATE)

COVID_plt <- just_time_date_filt %>%
  filter(CRASH_DATE >= as.Date("2019-12-01") & CRASH_DATE <= as.Date("2020-12-01")) %>%
  group_by(CRASH_DATE) %>% 
  summarise(num_crashes = n()) %>%
  ggplot(aes(x = CRASH_DATE, y = num_crashes)) +
  geom_line() +
  scale_x_date(date_breaks = "months", date_labels = "%b-%Y") +
  xlab("Date") +
  ylab("Number of Crashes") +
  theme(axis.text.x = element_text(size = 8, angle = 45)) +
  geom_vline(xintercept = as.numeric(lockdown_start), color = "red",linetype = "dotted", size = 1,linewidth = 0.4)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(COVID_plt)

For the second part of this project, I focused on the traffic crash people dataset. One of my objectives was to determine the gender distribution of drivers involved in accidents, specifically the proportion of male and female drivers.

From this analysis, a larger proportion of drivers involved in car accidents were male.

#Using the people dataset, what proportion of drivers were male vs female?

traffic_crash_people %>%
  filter(SEX != "X" & SEX !="", PERSON_TYPE == "DRIVER") %>%
  group_by(SEX) %>%
  summarise(num_crashes = n()) %>%
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = SEX,y = proportion_crashes)) +
  geom_bar(stat = "identity", fill = "deepskyblue4") +
  xlab("Sex") +
  ylab("Proportion of Crashes")

Does the proportion of crashes resulting in injury or towing vary depending on Sex? I will answer this question by joining the main traffic crash dataset with the people data set to see if there was an injury as indicated by the CRASH_TYPE column.

Based on this bar plot, it looks like out of all of the crashes resulting in tow or injury, a larger proportion of them involved male drivers.

#Join people data with main traffic crash data based on crash record ID
people_crashes_joined <- traffic_crash_people %>% left_join(traffic_crashes, by = c("CRASH_RECORD_ID" = "CRASH_RECORD_ID"))

people_crashes_joined %>%
  filter(CRASH_TYPE == "INJURY AND / OR TOW DUE TO CRASH",SEX != "X" & SEX !="",PERSON_TYPE == "DRIVER") %>%
  group_by(SEX) %>%
  summarise(num_crashes = n()) %>% 
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = SEX,y = proportion_crashes)) +
  geom_bar(stat = "identity", fill = "deepskyblue4") +
  xlab("Sex") +
  ylab("Proportion of Crashes Resulting in Tow or Injury")

Based on this graph, it appears that males experience a greater number of crashes resulting in injury and/or tow, relative to their total number of crashes, compared to females.

people_crashes_joined %>%
  select(PERSON_TYPE,SEX,CRASH_TYPE) %>% 
  filter(SEX != "X" & SEX !="",PERSON_TYPE == "DRIVER") %>%
  group_by(SEX,CRASH_TYPE) %>%
  summarise(num_crashes = n()) %>%
  ggplot(aes(x = SEX, y = num_crashes, fill = CRASH_TYPE)) +
  geom_bar(stat = "identity")
## `summarise()` has grouped output by 'SEX'. You can override using the `.groups`
## argument.

However, upon examining the proportions of crashes that resulted in injury and/or towing out of the total number of crashes for each sex, it appears that the proportion of crashes resulting in tow and/or injury is quite similar for both males and females.

people_crashes_joined %>%
  select(PERSON_TYPE,SEX,CRASH_TYPE) %>% 
  filter(SEX != "X" & SEX !="",PERSON_TYPE == "DRIVER") %>%
  ggplot(aes(x = SEX, fill = CRASH_TYPE)) +
  geom_bar(position = "fill") +
  ylab("Proportion of Crashes")

Next, I wanted to see how the number of vehicle crashes compared across different age groups.

I filtered the data to only include drivers with an age of 15 or older. This was done as 15 is the minimum age for obtaining a learner’s permit in Illinois, and it is unlikely for anyone younger than that to be legally driving on the roads. I also thought it was unlikely that there would be many people over 100 on the road.

The highest proportion of car crashes involve drivers in the age range 20-30 years old.

#Using the people dataset, how do vehicle crashes compare across different age groups?

people_agefilt <- traffic_crash_people %>% filter(AGE > 15 & AGE <= 100, PERSON_TYPE == "DRIVER", !is.na(AGE))

max_age = max(people_agefilt$AGE)

people_agefilt %>%
  mutate(age_group = cut(AGE,breaks = seq(0,max_age,by = 10))) %>%
  group_by(age_group) %>%
  summarise(num_crashes = n()) %>%
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = age_group,y = proportion_crashes)) +
  geom_bar(stat = "identity", fill = "deepskyblue4") +
  xlab("Age Group") +
  ylab("Proportion of Crashes")

Now, how do number of crashes compare across different sexes and age groups?

Based on this graph, it looks like the highest proportion of crashes involve female drivers in the age range 20-30 years old.

people_agefilt %>%
  filter(SEX != "X" & SEX !="", AGE > 15, PERSON_TYPE == "DRIVER", !is.na(AGE)) %>%
  mutate(age_group = cut(AGE,breaks = seq(0,max_age,by = 10))) %>%
  group_by(SEX,age_group) %>%
  summarise(num_crashes = n()) %>%
  mutate(proportion_crashes = num_crashes/sum(num_crashes)) %>%
  ggplot(aes(x = age_group,y = proportion_crashes, fill = SEX)) +
  geom_bar(stat = "identity", position = "dodge") +
  xlab("Age Group") +
  ylab("Proportion of Crashes")
## `summarise()` has grouped output by 'SEX'. You can override using the `.groups`
## argument.

What are the most common primary contributory causes of car crashes?

The most common primary contributory causes of car crashes are failing to yield right-of-way and following too closely.

#What is the most common primary contributory cause of car crashes

traffic_crashes %>%
  filter(PRIM_CONTRIBUTORY_CAUSE != "NOT APPLICABLE" & PRIM_CONTRIBUTORY_CAUSE != "UNABLE TO DETERMINE") %>%
  group_by(PRIM_CONTRIBUTORY_CAUSE) %>%
  summarise(num_crashes= n()) %>%
  arrange(-num_crashes) %>%
  head(5) %>%
  ggplot(aes(x = PRIM_CONTRIBUTORY_CAUSE,y = num_crashes)) +
  geom_bar(stat = "identity", fill = "darkmagenta") +
  coord_flip() +
  xlab("Primary Contributory Cause") +
  ylab("Number of Crashes")

What are the most common crash types? Does this correspond to the top primary contributory causes of car crashes?

The two top crash types are rear end and parked motor vehicle. Rear end being a top crash type makes sense considering following too closely is a top primary contributory cause and I guess a lot of people just hit parked vehicles.

traffic_crashes %>%
  group_by(FIRST_CRASH_TYPE) %>%
  summarise(num_crashes= n()) %>%
  arrange(-num_crashes) %>%
  head(5) %>%
  ggplot(aes(x = FIRST_CRASH_TYPE,y = num_crashes)) +
  geom_bar(stat = "identity", fill = "darkmagenta") +
  coord_flip() +
  xlab("First Crash Type") +
  ylab("Number of Crashes")

For the last part of this project I wanted to make a map to help visualize how the number of car crashes varies by different neighborhoods.

From the map, Austin has the largest number of crashes. Englewood also has a large number of crashes.

library(ggplot2)
library(sf)
## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse)
library(viridis)
## Loading required package: viridisLite
#Read in the Chicago neighborhoods shape file
chi_map <- read_sf("/Users/lilykraus/Downloads/Boundaries - Neighborhoods.geojson")

#This adds labels for the different neighborhoods
#neighborhoods_labeled <- ggplot(data=chi_map) + geom_sf() + geom_sf_text(aes(label = pri_neigh), size = 1)

#Filter out missing location 
no_missing_loc <- traffic_crashes %>% filter(!is.na(LATITUDE) & !is.na(LONGITUDE) & LONGITUDE != 0) %>% select(LATITUDE,LONGITUDE,LOCATION)

no_missing_loc %>% 
  st_as_sf(wkt = "LOCATION", crs = st_crs(chi_map)) %>% 
  #put point inside polygon
  st_join(chi_map) %>%
  #Put into regular table
  tibble() %>% 
  group_by(pri_neigh) %>%
  tally() %>%
  rename(num_crashes = n) %>%
  #join back with spatial data based on primary neighborhood name
  left_join(chi_map) %>% 
  #makes it an sf object
  st_as_sf() %>% 
  ggplot() +
  geom_sf(aes(fill = num_crashes)) +
  scale_fill_viridis()
## Joining with `by = join_by(pri_neigh)`